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There are (at least) three approaches to quantifying information. The first, algorithmic information 
■ or Kolmogorov complexity, takes events as strings and, given a universal Turing machine, quantifies 

the information content of a string as the length of the shortest program producing it The 
second, Shannon information, takes events as belonging to ensembles and quantifies the information 
resulting from observing the given event in terms of the number of alternate events that have been 
ruled out [2 |. The third, statistical learning theory, has introduced measures of capacity that control 
(in part) the expected risk of classifiers [3|. These capacities quantify the expectations regarding 
future data that learning algorithms embed into classifiers. 
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Solomonoff and Hutter have applied algorithmic information to prove remarkable results on uni- 
versal induction. Shannon information provides the mathematical foundation for communication 
and coding theory. However, both approaches have shortcomings. Algorithmic information is not 
computable, severely limiting its practical usefulness. Shannon information refers to ensembles 
rather than actual events: it makes no sense to compute the Shannon information of a single string 
- or rather, there are many answers to this question depending on how a related ensemble is con- 
structed. Although there are asymptotic results linking algorithmic and Shannon information, it is 
unsatisfying that there is such a large gap - a difference in kind - between the two measures. 
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This note describes a new method of quantifying information, effective information, that links algo- 
rithmic information to Shannon information, and also links both to capacities arising in statistical 
learning theory [4 5]. After introducing the measure, we show that it provides a non-universal ana- 
log of Kolmogorov complexity. We then apply it to derive basic capacities in statistical learning 
theory: empirical VC-entropy and empirical Rademacher complexity. A nice byproduct of our ap- 
proach is an interpretation of the explanatory power of a learning algorithm in terms of the number 
of hypotheses it falsifies [6], counted in two different ways for the two capacities. We also discuss 
how effective information relates to information gain, Shannon and mutual information. 



Effective information 



Any physical system, at any spatiotemporal scale, is an input/output device. For simplicity, we only 
model memoryless systems with finite input X and output y alphabets. The probability that system 
m outputs y e y given input x e X is encoded in Markov matrix p m (y\x). 

The effective information generated when system m outputs y is computed as follows. First, let 
the potential repertoire p U nif(X) be the input set equipped with the uniform distribution. Next, 
compute the actual repertoire via Bayes' rule 

p(X\y):=p(y\do(x))-^j^, (1) 

Pm(y) 

where p m (y) = £ x Pm (y\do(x)} ■ p un if(x) and do(-) refers to Pearl's interventional calculus [7]. 
Effective information is the Kullback-Leibler divergence between the two repertoires 

ei(m,y) := D[p m (X\y) \\p unif (X)]. (2) 

For a deterministic function / : X — > y, the actual repertoire and effective information are 

p f (x\y) = j^pT lf/ ^ e =y and ei(f, y) = log 2 \X\ - log 2 \f-\y)\. (3) 
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The support of the actual repertoire is the pre-image /~ x (y). Elements in the pre-image all have 
the same probability since they cannot be distinguished by the function /. Effective information 
quantifies the size of the pre-image relative to the input set - the smaller ("sharper") the pre-image, 
the higher ei. 



Algorithmic information 

We show that effective information is a non-universal analog of Kolmogorov complexity. Given 
universal Turing machine T, the (unnormalized) Solomonoff prior probability of string s is 

p T (s): = £ 2- len « (4) 

{i\T(i)=s.} 

where the sum is over strings i that cause T to output s as a prefix, where no proper prefix of i outputs 
s, and len(i) is the length of i, Kolmogorov complexity is K(s) := — \og 2 PT(s). Kolmogorov 
complexity is usually defined as the shortest program on a universal prefix machine that produces s. 
The two definitions coincide up to additive constant by Levin's Coding Theorem 0]. 

Replace universal Turing machine T with deterministic system / : X — > y. All inputs have 

len(a;) = log 2 \X\ in the optimal code for the uniform distribution on X. Define the effective 
probability of y as 

Pfiy)= J] 2 -ien (a: ) Hyef(X) (5) 

{x\f(x)=y} I else - 

Note that Pf(y) is a special case of p m (y), as defined after Eq. ([T). The effective distribution is thus 
a non-universal analog of the Solomonoff prior, since it is computed by replacing universal Turing 
machine T in Eq. with deterministic physical system / : X — » y. 

In the deterministic case, effective information turns out to be ei(f, y) = — \og 2 Pf(y), analogously 
to Kolmogorov complexity. Effective information is non-universal - but computable - since it de- 
pends on the choice of /. 

Statistical learning theory 

This section uses a particular deterministic function, learning algorithm £.f,x>, to connect effective 
information and the effective distribution to statistical learning theory. 

Given finite set X, let hypothesis space Y,x = {c : X — » +1} contain all labelings of elements of 
X. Now, given a set of functions T c and unlabeled data V e X 1 , define learning algorithm 
(empirical risk minimizer) 

1 1 

£f,v : — » R : o ~ e = min - V I [f(d k ) * a{d k )} . (6) 



k = l 

The learning algorithm takes a labeling of the data as input and outputs the empirical risk of the 
function that best fits the data. We drop subscripts from the notation £ below. 

Define empirical VC-entropy in O) as V(F : V) := log 2 ^©(J 7 )! where qx> : T — » R ; : / <-> 
(/(tii) ■ • • f{d{j). Also define empirical Rademacher complexity as 
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These capacities can be used to bound the expected risk of classifiers, see El[9l for details. The 
following propositions are proved in @: 

Proposition 1 (effective information "is" empirical VC-entropy). 

ei(£,0) = -log 2 p £ (0) = I -V{F,V) 
Proposition 2 (expectation over pz (e) "is" empirical Rademacher complexity). 

E[e\p z ] = Y i e-Pz(e) = l(l-K(F,V)) 
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Thus, replacing the universal Turing machine with learning algorithm £jtd we obtain that our 
analog of Kolmogorov complexity, the effective information of output e = 0, is essentially empirical 
VC-entropy. Moreover, the expectation of the analog of the Solomonoff distribution is essentially 
Rademacher complexity. 

The two quantities ei(£, 0) and E[e \ps.] are measures of explanatory power: as they increase, so 
expected future performance improves. By Eq. (0, the effective information generated by £ is 

ei(£,0) = log 2 |S| - log 2 |£" 1 (0)| = (# hypotheses £ falsifies ) , (7) 

total # hypotheses # hypotheses £ fits 

where hypotheses are counted after logarithming. Effective information, which relate to VC-entropy, 
counts the number of hypotheses the learning algorithm falsifies when it fits labels perfectly, without 
taking into account how often they are wrong. Similarly, see [5] for details, the expectation is 

2 C " e = 2 e ^fraction of hypotheses £ falsifies^ ■ ^on fraction e of the data^j . (8) 

Expected e, which relates to Rademacher complexity, looks at the average behavior of the learning 
algorithm, averaging over the fractions of hypotheses falsified, weighted by how much of the data 
they are falsified on. 

The bounds proved in (3][8][9], which control the expected future performance of the classifier mini- 
mizing empirical risk, can therefore be rephrased in terms of the number of hypotheses falsified by 
the learning algorithm, Eqs (Q and ([8]), suggesting a possible route towards rigorously grounding 
the role of falsification in science J5J. 

Shannon information 

We relate effective information to Shannon and mutual information. 

Suppose we have model m that generates data d e D with probability p m (d\h) given hypothesis 
h e H. For prior distribution p(H) on hypotheses, the information gained by observing d is 

D[ Pm (H\d)\\p(H)]. (9) 

Kullback-Leibler divergence can be interpreted as the number of Y/N questions required to 

get from q to p. Thus, Eq. (0 quantifies how many Y/N questions the model answers about the 
hypotheses using the data. 

Effective information, Eq. (0, quantifies the information gained when physical system m outputs 
y. Rather than inferring on hypotheses, the system, by producing an output, specifies probabilistic 
constraints on what its input must have been. Effective information uses the uniform (maximum 
entropy) prior since any other prior would insert additional data not belonging to the system - the 
prior is something else, on top of m. However, this restriction is not essential and will be dropped 
for the remainder of this section. 

Consider the following scenario. We have X and X' are isomorphic, and a deterministic physical 
system c : X — > X' that copies its inputs, mapping x k i— > x' k for example. Given prior p(X), the 
effective information generated is 

ei(p(X),c,x' k ) := D[p c (X\x' k ) \\p(X)] = D[5 Xk \\p(X)] = -\og 2 p(x k ), 

the surprise of x k . It follows that Shannon information is expected effective information 



H(X)=E[ei(p(X),c,x' k )\p c (X')] 



More generally, if we are given noisy memoryless channel m from X to y with distribution p(X) 
on X, then mutual information is the expectation 



I(X;Y)=E[ei(p(X), m ,y)\p m (Y)] 



where p m (y) = ^ x Pm (y\do(x)) -p(x) is the effective distribution on Y. Thus, Shannon and mutual 
information are simply averages of effective information, our non-universal analog of Kolmogorov 
complexity. 

Finally, interpreting effective information as information gain, Eq. (O, and combining with results 
from the previous section shows that (I — empirical VC-entropy) is the information we gain about 
the set Tjx of hypotheses when told that learning algorithm £jf z> fit the labeled data perfectly. 
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Discussion 



This note starts from the observation that all physical systems classify inputs and thereby gener- 
ate information. A deterministic physical system / : X — » Y implicitly categorizes its inputs by 
assigning them to outputs: the category assigned to output y is the set of inputs in the pre-image 
f (y) <z X. The intuition carries through in the probabilistic case after replacing pre-images with 
actual repertoires. Effective information then quantifies the sharpness of categories: the sharper a 
category, the more informative the corresponding output. Alternatively, effective information quan- 
tifies causal dependencies: outputs with high ei are extremely sensitive to changes in the input. 

Effective information is a concrete, computable analog of Kolmogorov complexity. The Kol- 
mogorov complexity of a string quantifies the "work" required to produce it; roughly, the length 
of the programs that output it. Since universal Turing machines require infinite storage space and 
are therefore impossible to construct, it is unclear how relevant they are to processes actually oc- 
curring in nature. Effective information substitutes a deterministic model of a physical system in 
place of the universal Turing machine, and quantifies the "work" required to produce an output as 
the number of Y/N decisions required to choose it. 

Both Shannon and mutual information arise as expectations of effective information after tweaking 
to get rid of the uniform prior. The difference between Kolmogorov complexity and Shannon infor- 
mation reduces to: (i) replacing a universal Turing machine with a specific system (channel) and (ii) 
computing the average information gain over all outputs, rather than a single one. 

When the physical process under consideration is empirical risk minimization, the effective infor- 
mation it generates contributes to bounds on expected risk. In particular, the work (the number of 
Y/N decisions) required to fit data T> using functions in T essentially is the empirical VC-entropy. 
Since finding the optimal classifier in T requires computing £f.-d in some way or another, thereby 
implementing it physically, it follows that the effective information generated while fitting data has 
implications for the future performance of classifiers, see [3, 8j|9). 

Effective information and the expected risk over the effective distribution also provide new inter- 
pretations of VC-entropy and Rademacher complexity in terms of falsifying hypotheses, see Eq. (0 
and ([8]l - and also ifTUl for a comparison of falsification with VC-dimension. Viewing empirical 
risk minimization as a physical process that classifies hypotheses according to fit e thus directly 
links VC-entropy and Rademacher complexity with Popper's proposal that the power of a scientific 
theory lies in how many hypotheses it rules out, rather than the amount of data it explains 0. 

The links with Kolmogorov complexity, learning theory, information gain and falsification shown 
above suggest it is worth investigating whether the effective information generated while optimizing 
quantities other than empirical risk (e.g. margins) has implications for future performance. 
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